Finding the most interesting patterns in a database quickly by using sequential sampling

ABSTRACT

Many discovery problems, e.g., subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility be the average (over the examples) of some function—which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.

This application is the national phase under 35 U.S.C. § 371 of PCTInternational Application No. PCT/EP01/09541 which has an Internationalfiling date of Aug. 18, 2001, which designated the United States ofAmerica.

The invention relates to finding the most interesting patterns in adatabase quickly by using sequential sampling in particular to a methodfor sampling a database for obtaining the probably approximately n besthypotheses having the highest empirically utility of a group ofpotential hypotheses.

1 Introduction

The general task of knowledge discovery in databases (KDD) is the“automatic extraction of novel, useful, and valid knowledge from largesets of data”. An important aspect of this task is scalability, i.e.,the ability to successfully perform discovery in ever-growing datasets.Unfortunately, even with discovery algorithms optimized for very largedatasets, for many application problems it is infeasible to process allof the given data. Whenever more data is available than can be processedin reasonable time, an obvious strategy is to use only a randomly drawnsample of the data. Clearly, if parts of the data are not looked at, itis impossible in general to guarantee that the results produced by thediscovery algorithm will be identical to the rest res turned on thecomplete dataset. If the use of sampled datasets is to be more than apractitioner's “hack”, sampling must be combined with discoveryalgorithms in a fashion that allows us to give the user guarantees abouthow far the results obtained using sampling differ from the optimal(non-sampling based) results. The goal of a sampling discovery algorithmthen is to guarantee this quality using the minimum amount of examples[5].

Known algorithms that do give rigorous guarantees on the quality of thereturned solutions for all possible problems usually require animpractically large amount of data. One approach to finding practicalalgorithms is to process a fixed amount of data but determine thepossible strength of the quality guarantee dynamically, based oncharacteristics of the data; this is the idea of self-bounding learningalgorithms and shell decomposition bounds. Another approach (which wepursue) is to demand a certain fixed quality and determine the requiredsample size dynamically based on characteristics of the data that havealready been seen; this idea has originally been referred to assequential analysis.

In the machine learning context, the idea of sequential sampling hasbeen developed into the Hoeffding race algorithm [3] which processesexamples incrementally, updates the empirical utility valuessimultaneously, and starts to output (or discard) hypotheses as soon asit becomes very unlikely that some hypothesis is not near-optimal (orvery poor, respectively). The incremental greedy learning algorithm Palo[2] has been reported to require many times fewer examples than theworst-case bounds suggest. In the context of knowledge discovery indatabases, too, sequential sampling algorithms can reduces the requiredamount of data significantly [1].

These existing sampling algorithms address discovery problems where thegoal is to select from a space of possible hypotheses H one of theelements with maximal value of an instance-averaging quality function f,or all elements with an f-value above a user-given threshold (e.g., allassociation rules with sufficient support). With instance-averagingquality functions, the quality of a hypothesis h is the average acrossall instances in a dataset D of an instance quality function f_(inst).

Many discovery problems, however, cannot easily be cast in thisframework. Firstly, it is often more natural for a user to ask for the nbest solutions instead of the single best or all hypotheses above athreshold—see, e.g., [7]. Secondly, many popular quality measures cannotbe expressed as an averaging quality function. This is the case e.g.,for all functions that combine generality and distributional propertiesof a hypothesis; generally, both generality and distributionalproperties (such as accuracy) have to be considered for association ruleand subgroup discovery problems. The task of subgroup discovery is tofind maximally general subsets of database transactions within which thedistribution of a focused feature differs maximally from the defaultprobability of that feature in the whole database. As an example,consider the problem of finding groups of customers who are particularlylikely (or unlikely) to buy a certain product.

In this paper, we present a general sampling algorithm for the n-besthypotheses problem that works for any utility functions that can beestimated with bounded error at all. To this end, in Section 2, we firstdefine the n-best hypotheses problem more precisely and identifyappropriate quality guarantees. Section 3 then presents the genericsequential sampling algorithm. In Section 4, we prove that many of thepopular utility functions that have been used in the area of knowledgediscovery in databases indeed can be estimated with bounded error,giving detailed bounds. In order to motivate the instantiations of oursampling algorithm and put it into context, we first define somerelevant knowledge discovery tasks in Section 4. For one popular classof functions that cannot be used by our algorithm, we prove that therecannot be a sampling algorithm at all. Our results thus also give anindication as to which of the large numbers of popular utility functionsare preferable with respect to sampling. In Section 6, we evaluate ourresults and discuss their relation to previous work.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a plurality of graphs comparing sample sizes fornon-sequential and sequential sampling on a database.

FIG. 2 shows a plurality of graphs comparing sample sizes fornon-sequential and sequential sampling on a different database.

2 Approximating n-Best Hypotheses Problems

In many cases, it is more natural for a user to ask for the n bestsolutions instead of the single best or all hypotheses above athreshold. Such n-best hypotheses problems can be stated more preciselyas follows—adapted from [7], where this formulation is used for subgroupdiscovery:

Definition 1 (n-best hypotheses problem) Let D be a database ofinstances, H a set of possible hypotheses, f: H×D→R≧0 a quality orutility function on H, and n, 1≦n<|H|, the number of desired solutions.The n-best hypotheses problem is to find a set G⊂H of size n such thatthere is no h′∈H: h′∉ G and f(h′,D)>f _(min), where f _(min):=min_(h∈G)f(h′,D).

Whenever we use sampling, the above optimality property cannot beguaranteed, so we must find appropriate alternative guarantees. Sincefor n-best problems, the exact quality and rank of hypotheses is oftennot central to the user, it is sufficient to guarantee that G indeed“approximately” contains the n best hypotheses. We can operationalizethis by guaranteeing that there will never be a non-returned hypothesisthat is “significantly” better than the worst hypothesis in oursolution. More precisely, we will use the following problem formulatedalong the lines of PAC (probably approximately correct) learning:

Definition 2 (Approximate n-best hypotheses problem) Let D, H, f and nas in the preceding definition. Then let δ, 0<δ≦1, be a user-specifiedconfidence, and ε∈

⁺ a user-specified maximal error. The approximate n-best hypothesesproblem is to find a set G⊂ H of size n such thatwith confidence 1−δ, there is no h′∈H: h′∉G and f(h′,D)>f _(min)+ε,where f _(min):=min_(h∈G) f(h,D).

In other words, we want to find a set of n hypotheses such that, withhigh confidence, no other hypothesis outperforms any one of them by morethan ε, where f is an arbitrary performance measure.

In order to design an algorithm for this problem, we need to makecertain assumptions about the quality function f. Ideally, an algorithmshould be capable of working (at least) with the kinds of qualityfunctions that have already proven themselves useful in practicalapplications. If the problem is to classify database items (i.e., tofind a total function mapping database items to class labels), accuracyis often used as utility criterion. For the discovery of associationrules, by contrast, one usually relies on generality as primary utilitycriterion. Finally, for subgroup discovery, it is commonplace to combineboth generality and distributional unusualness, resulting in relativelycomplex evaluation functions.

In light of the large range of existing and possible future utilityfunctions and in order to avoid unduly restricting our algorithm, wewill not make syntactic assumptions about f. In particular, unlike [1],we will not assume that f is a single probability nor that it is (afunction of) an average of instance properties. Instead, we only assumethat it is possible to determine a confidence interval f that bounds thepossible difference between true utility (on the whole database) andestimated utility (on the sample) with a certain confidence. We expectthe confidence interval to narrow as the sample size grows. As will showin Section 4 below, finding such confidence intervals is straightforwardfor classification accuracy, and is also possible for all but one of thepopular utility functions from association rule and subgroup discovery.More precisely, we define an confidence interval for f as follows.

Definition 3 (Utility confidence interval) Let f be a utility function,let h∈H be a hypothesis. Let f(h,D) denote the true quality of h on theentire dataset, {circumflex over (f)}(h,Q_(m)) its estimated qualitycomputed based on a sample Q_(m) ⊂D of size m. Then E:

×

→

is a utility confidence bound for f iff for any δ, 0<δ≦1,Pr _(S) [|{circumflex over (f)}(h, Q _(m))−f(h, D)|≦E(m, δ)]≧1−δ  (1)

Equation 1 says that E provides a two-sided confidence interval on{circumflex over (f)}(h;Q_(m)) with confidence δ. In other words, theprobability of drawing a sample Q_(m) (when drawing m transactionsindependently and identically distributed from D), such that thedifference between true and estimated utility of any hypothesis disagreeby ε or more (in either direction) lies below δ. If, in addition, forany δ,0<δ≦1 and any ε there is a number m such that E(m,δ)≦ε we say thatthe confidence interval vanishes. In this case, we can shrink theconfidence interval (at any confidence level δ) to arbitrarily lownonzero values by using a sufficiently large sample. We sometimes writethe confidence interval for a specific hypothesis h as E_(h)(m,δ). Thus,we allow the confidence interval to depend on characteristics of h, suchas the variance of one or more random variables that the utility of hdepends on.

We will discuss confidence intervals for different functions of interestin Section 4; here, as a simple example, let us only note that if f issimply a probability over the examples, then we can use the Chernoffinequality to derive a confidence interval; when f is the average (overthe examples) of some function with bounded range, then the Hoeffdinginequality implies a confidence interval. Of course, we should also notethat the trivial function E(m,δ):=Λ is an error probability boundfunction for any f with lower bound of zero and upper bound of Λ, but wewill see that we can only guarantee termination when the confidenceinterval vanishes as the sample size grows.

3 Sampling Method According to the Invention

According to the invention there is provided a method for sampling adatabase for obtaining the probably approximately n best hypotheseshaving the highest empirically utility of a group of potentialhypotheses, comprising the steps of (see also Table 1 of thisspecification):

-   -   a) generating all possible hypotheses based on specifications of        a user as remaining hypotheses,    -   b) checking whether enough data points were sampled so far to        distinguish within the group of potential hypotheses all good        looking hypotheses from bad looking hypotheses with sufficient        confidence,    -   c) if according to step b) enough data points were sampled so        far then continue with step j) otherwise continue with step f),    -   d) sampling one or more data points,    -   e) calculating the utility of all the remaining hypotheses on        the basis of the sampled data points,    -   f) in the set of remaining hypotheses determining a set of good        looking hypotheses by taking the n hypotheses that currently        have the highest utility based on the data points sampled so        far, whereas the other hypotheses are added to a set of bad        looking hypotheses,    -   g) checking each of the remaining hypotheses wherein        -   i) if the currently considered hypothesis with sufficient            probability is sufficiently good when compared to all bad            looking hypotheses then            -   outputting the currently considered hypothesis,            -   removing the currently considered hypothesis from the                set of remaining hypotheses and from the set of good                looking hypotheses,            -   decrementing the number of hypotheses still to be found                and            -   if the number of hypotheses still to be found is zero or                the number of remaining hypotheses is equal to the                number of hypotheses still to be found, then continue                with step j),        -   ii) if the currently considered hypotheses with sufficient            probability is sufficiently bad when compared with all the            good looking hypotheses then            -   removing the currently considered hypothesis from the                set of bad looking hypotheses from the set of remaining                hypotheses and            -   if the number of remaining hypotheses is equal to the                number of hypotheses still to be found, then continue                with step j),    -   h) continue with step a),    -   j) outputting to a user the n best hypotheses having the highest        utility of all the remaining hypotheses.

In one embodiment in steps a), g)i), and g)ii the following step k) isperformed:

-   -   k) determining a utility confidence interval given a required        maximum probability of error and a sample size based on the        observed variance of each hypothesis' utility value.

In another embodiment in step a) the following step l) is performed (seestep 2. of Table 1):

-   -   l) based on step k) determining the size of the utility        confidence interval for the current data point sample size and        the probability of error as selected by the user and dividing        the size of the utility confidence interval by two and by the        number of remaining hypotheses, wherein enough data points are        sampled if the number as calculated above is smaller than the        user selected maximum margin of error divided by two.

In still another aspect of the invention in step f)i) the followingsteps are performed (see step 3.(e)i. of Table 1):

-   -   m) determine a locally allowed error probability based on the        probability of error selected by the user divided by two and the        number of remaining hypotheses and the smallest data point        sample size that makes the error confidence interval according        to step j) smaller than or equal to the user selected maximum        error margin when determined based on the user selected error        probability divided by two and the size of the initial set of        hypotheses,    -   n) for each of the bad looking hypotheses, determine the sum of        its observed utility value and its error confidence interval        size as determined according to step k) for the current data        point sample size and the locally allowed error probability        error according to step m),    -   o) selecting the maximum value of all the sums determined in        step n),    -   p) subtracting the user selected maximum error margin from the        value selected in step o),    -   q) adding to the number obtained in step p) the size of the        error confidence interval as determined according to step k) for        the hypothesis currently considered for outputting based on the        current data point sample size and the locally allowed error        probability,    -   r) if the number determined according to steps m) to q) is no        larger than the hypothesis currently considered for outputting        and this hypothesis is a good looking one, then this hypothesis        is considered sufficiently good with sufficient probability as        required in step g)i).

In a further aspect of the invention in step g)ii) the following stepsare performed (see step 3(e)ii. of Table 1):

-   -   s) determine a locally allowed error probability based on the        probability of error selected by the user divided by two and the        number of remaining hypotheses and the smallest data point        sample size that makes the error confidence interval according        to step k) smaller than or equal to the user selected maximum        error margin when determined based on the user selected error        probability divided by two and the size of the initial set of        hypotheses,    -   t) for each good looking hypothesis determine the difference        between its observed utility value and its error confidence        interval as determined in step k) based on the current data        point sample size and the locally allowed error probability as        determined in step s),    -   u) selecting the minimum of all the differences determined in        step t),    -   v) subtract from the value selected in step u) the size of the        utility confidence interval of the hypothesis considered for the        removal and determined according to step k) and the locally        allowed error probability as determined in step s),    -   w) if the number determined in steps s) to v) is not smaller        than the observed utility of the hypothesis considered for        removal then this hypothesis is considered sufficiently bad with        sufficient probability as required in step g)ii).

The general approach to designing a sampling method is to use anappropriate error probability bound to determine the required number ofexamples for a desired level of confidence and accuracy. When estimatinga single probability, Chernoff bounds that are used in PAC theory andmany other areas of statistics and computer science can be used todetermine appropriate sample bounds [5]. When such algorithms areimplemented, the Chernoff bounds can be replaced by tighter normal or tdistribution tables.

Unfortunately, the straightforward extension of such approaches toselection or comparison problems like the n-best hypotheses problemleads to unreasonably large bounds: to avoid errors in the worst case,we have to take very large samples to recognize small differences inutility, even if the actual differences between hypotheses to becompared are very large. This problem is addressed by sequentialsampling methods (that have also been referred to as adaptive samplingmethods [1]). The idea of sequential sampling is that when a differencebetween two frequencies is very large after only a few examples, then wecan conclude that one of the probabilities is greater than the otherwith high confidence; we need not wait for the sample size specified bythe Chernoff bound, which we have to when the frequencies are similar.Sequential sampling methods have been reported to reduce the requiredsample size by several orders of magnitude (e.g., [2]).

In our method (see Table 1), we combine sequential sampling with thepopular “loop reversal” technique found in many KDD algorithms. Insteadof processing hypotheses one after another, and obtaining enoughexamples for each hypothesis to evaluate it sufficiently precisely, wekeep obtaining examples (step 3b) and apply these to all remaininghypotheses simultaneously (step 3c). This strategy allows the algorithmto be easily implemented on top of database systems (assuming they arecapable of drawing samples), and enables us to reach tighter bounds.After the statistics of each remaining hypothesis have been updated, thealgorithm checks all remaining hypotheses and (step 3(e)i) outputs thosewhere it can be sufficiently certain that the number of betterhypotheses is no larger than the number of hypotheses still to be found(so they can all become solutions), or (step 3(e)ii) discards thosehypotheses where it can be sufficiently certain that the number ofbetter other hypotheses is at least the number of hypotheses still to befound (so it can be sure the current hypothesis does not need to be inthe solutions). When the algorithm has gathered enough information todistinguish the good hypotheses that remain to be found from the badones with sufficient probability, it exits in step 3.

Indeed it can be shown that this strategy leads to a total errorprobability less than δ as required.

TABLE 1 Sequential sampling algorithm for the n-best hypotheses problemAlgorithm Generic Sequential Sampling. Input: n (number of desiredhypotheses), ε and δ (approximation and confidence parameters). Output:n approximately best hypotheses (with confidence 1 − δ). 1. Let n₁ = n(the number of hypotheses that we still need to find) and Let H₁ = H(the set of hypotheses that have, so far, neither been discarded noraccepted). Let Q₀ = ∅ (no sample drawn yet). Let i = 1 (loop counter).2.${\text{Let M be the smallest number such that}{E\left( {M,\frac{\delta}{2{H}}} \right)}} \leq {\frac{ɛ}{2}.}$3.${{Repeat}\mspace{14mu}{until}\mspace{14mu} n_{i}} = {{0\mspace{14mu}{Or}\mspace{14mu}{H_{i + 1}}} = {{n_{i}\mspace{14mu}{Or}\mspace{14mu}{E\left( {i,\frac{\delta}{2{H_{i}}}} \right)}} \leq \frac{ɛ}{2}}}$(a) Let H_(i+1) = H_(i). (b) Query a random item of the database q_(i).Let Q_(i) = Q_(i−1) ∪ {q_(i)}. (c) Update the empirical utility{circumflex over (f)} of the hypotheses in H_(i). (d) Let H_(i)* be then_(i) hypotheses from H_(i) which maximize the empirical utility{circumflex over (f)}. (e) For h ∈ H_(i) While n_(i) > 0 And |H_(i)| >n_(i) $\begin{matrix}{i.} & {{{If}\mspace{14mu}{\hat{f}\left( {h,Q_{i}} \right)}} \geq {{E_{h}\left( {i,\frac{\delta}{2M{H_{i}}}} \right)} + {\max\limits_{h_{k} \in {H_{i}\backslash H_{i}^{*}}}\left\{ {{\hat{f}\left( {h_{k},Q_{i}} \right)} +} \right.}}} \\\; & {{\left. {E_{h_{k}}\left( {i,\frac{\delta}{2M{H_{i}}}} \right)} \right\} - {ɛ\mspace{14mu}{And}\mspace{14mu} h}} \in {H_{i}^{*}\mspace{14mu}\left( {h\mspace{14mu}{appears}\mspace{14mu}{good}} \right)\mspace{14mu}{Then}\mspace{14mu}{Output}}} \\\; & {{{hypothesis}\mspace{14mu} h\mspace{14mu}{and}\mspace{14mu}{then}\mspace{14mu}{Delete}\mspace{14mu} h\mspace{14mu}{from}\mspace{20mu} H_{i + 1}\mspace{14mu}{and}\mspace{14mu}{let}\mspace{14mu} n_{i + 1}} =} \\\; & {n_{i} - {1.\mspace{14mu}{Let}\mspace{14mu} H_{i}^{*}\mspace{14mu}{be}\mspace{14mu}{the}\mspace{14mu}{new}\mspace{14mu}{set}\mspace{14mu}{of}\mspace{14mu}{empirically}\mspace{14mu}{best}\mspace{14mu}{{hypotheses}.}}}\end{matrix}\quad$ $\begin{matrix}{{ii}.} & {{{Else}\mspace{14mu}{If}\mspace{14mu}{\hat{f}\left( {h,Q_{i}} \right)}} \leq {{\min\limits_{h_{k} \in H_{i}^{*}}\left\{ {{\hat{f}\left( {h_{k},Q_{i}} \right)} - {E_{h_{k}}\left( {i,\frac{\delta}{2M{H_{i}}}} \right)}} \right\}} -}} \\\; & {{E_{h}\left( {i,\frac{\delta}{2M{H_{i}}}} \right)}\left( {h\mspace{14mu}{appears}\mspace{14mu}{poor}} \right)\mspace{14mu}{Then}\mspace{14mu}{Delete}\mspace{14mu} h\mspace{14mu}{from}} \\\; & {{{H_{i + 1}.\mspace{14mu}{Let}}\mspace{14mu} H_{i}^{*}\mspace{14mu}{be}\mspace{14mu}{the}\mspace{14mu}{new}\mspace{14mu}{set}\mspace{14mu}{of}\mspace{14mu}{empirically}\mspace{14mu}{best}}\;} \\\; & {{hypotheses}.}\end{matrix}\quad$ (f) Increment i. 4. Output the n_(i) hypotheses fromH_(i) which have the highest empirical utility.

Theorem 1 The algorithm will output a group G of exactly n hypothesessuch that, with confidence 1−δ, no other hypothesis in H has a utilitywhich is more than ε higher than the utility of any hypothesis that hasbeen returned:Pr[∃h∈H\G:f(h)>f _(min)+ε]≦δ  (2)where f_(min)=min_(h′∈G){f(h′)}; assuming that |H|≧n.

The proof of Theorem 1 can be found in Appendix A.

Theorem 2 (Termination) If for any δ(0<δ≦1) and ε>0 there is a number msuch that E(m,δ)≦ε, then the algorithm can be guaranteed to terminate.

Correctness of Theorem 2 follows immediately from Step 3e of thealgorithm. Theorem 2 says that we can guarantee termination if theconfidence interval vanishes for large numbers of examples. This is arather weak assumption that is satisfied by most utility functions, aswe will see in the next section.

4 Instantiations

In order to implement the algorithm for a given utility function we haveto find a utility confidence interval E(m,δ) that satisfies Equation 1for that specific f. In this section, we will introduce someterminology, and present a list of confidence intervals for the utilityfunctions that are most commonly used in knowledge discovery systems.Since the database is constant, we abbreviate f(h,D) as f(h) throughoutthis section.

Most of the known utility functions refer to confidence, accuracy,“statistical unusualness”, support or generality of hypotheses. Let usquickly put these terms into perspective. Association rules andclassification rules are predictive; for some database transaction theypredict the value of an attribute given the values of some otherattributes. For instance, the rule “beer=1→chips=1” predicts that acustomer transaction with attribute beer=1 will also likely have theattribute chips=1. However, when a customer does not buy beer, then therule does not make any prediction. In particular, the rule does notimply that a customer who does not buy beer does not buy chips either.The number of transactions in the database for which the rule makes acorrect prediction (in our example, the number of transactions whichinclude beer and chips) is called the support, or the generality.

Among those transitions for which the rule does make a prediction, somepredictions may be erroneous. The confidence is the fraction of correctpredictions among those transactions for which a prediction is made. Theaccuracy, too, quantifies the probability of a hypothesis conjecturing acorrect attribute. However, the term accuracy is typically used in thecontext of classification and refers to the probability of a correctclassification for a future transaction whereas the confidence refers tothe database (i.e., the training data). From a sampling point of view,confidence and accuracy can be treated equally. In both cases, arelative frequency is measured on a small sample; from this frequency wewant to derive claims on the underlying probability. It does not make adifference whether this probability is itself a frequency on a muchlarger instance space (confidence) or a “real” probability (accuracy),defined with respect to an underlying distribution on instances.

Subgroups are of a more descriptive character. They describe that thevalue of an attribute differs from the global mean value within aparticular subgroup of transactions without actually conjecturing thevalue of that attribute for a new transaction. The generality of asubgroup is the fraction of all transactions in the database that belongto that subgroup. The term statistical unusualness refers to thedifference between the probability p₀ of an attribute in the wholedatabase and the probability p of that attribute within the subgroup.Usually, subgroups are desired to be both general (large g) andstatistically unusual (large |p0−p|). There are many possible utilityfunctions for subgroup discovery which trade generality againstunusualness. Unfortunately, none of these functions can be expressed asthe average (over all transactions) of an instance utility function.But, in Sections 4.2 through 4.4 we will show how instantiations of theGSS algorithm can solve sampling problems for these functions.

We would like to conclude this subsection with a remark on whether asample should be drawn with or without replacement. When the utilityfunction is defined with respect to a finite database, it is, inprinciple, possible to draw the sample without replacement. When thesample size reaches the database size, we can be certain to have solvedthe real, not just the approximate, n best hypothesis problem. So itshould be possible to give a tighter utility confidence bound when thesample is drawn without replacement. Consider the simple case when theutility is a probability. When the sample is drawn with replacement, therelative frequency corresponding to the target probability is governedby the binomial distribution whereas, when the sample is drawn withoutreplacement, it is governed by the hyper-geometrical distribution and wecan specify a tighter bound. However, for sample sizes in the order ofmagnitude that we envision, the only feasible way of calculating boththe hyper-geometrical distribution and the binomial distribution is touse a normal approximation. But the normal approximation of bothdistributions are equal and so we cannot realize the small advantagethat drawing without replacement seems to promise. The same situationarises with other utility functions.

4.1 Instance-Averaging Functions

This simplest form of a utility function is the average, over allexample instances, of some instance utility function f_(inst)(h,q_(i))where q_(i) ∈ D. The utility is then defined as

${f(h)} = {\frac{1}{D}{\sum\limits_{i = 1}\;{{D}{f_{inst}\left( {h,q_{i}} \right)}}}}$(the average over the whole database) and the estimated utility is

${\hat{f}\left( {h,Q_{m}} \right)} = {\frac{1}{m}{\sum\limits_{q_{i} \in Q_{m}}\;{f_{inst}\left( {h,q_{i}} \right)}}}$(average over the example queries). An easy example of aninstance-averaging utility is classification accuracy (wheref_(inst)(h,q_(i)) is 0 or 1). Besides being useful by itself, this classof utility functions serves as an introductory example of how confidenceintervals can be derived. We assume that the possible range of utilityvalues lies between 0 and Λ. In the case of classification accuracy, Λequals one.

We can use the Hoeffding inequality to bound the chance that anarbitrary (bounded) random variable X takes a value which is farawayfrom its expected value E(X) (Equation 3). When X is a relativefrequency and E(X) the corresponding probability, then we know that Λ=1.This special case of the Hoeffding inequality is called Chernoff'sinequality.

$\begin{matrix}{{P_{T}\left\lbrack {{{X - {E(X)}}} \leq ɛ} \right\rbrack} \geq {1 - {2\exp\left\{ {{- 2}m\frac{ɛ^{2}}{\Lambda^{2}}} \right\}}}} & (3)\end{matrix}$

We now need to define a confidence interval that satisfies Equation 1,where the Hoeffding inequality serves as a tool to prove Equation 1. Wecan easily see that Equation 4 satisfies this condition.

$\begin{matrix}{{E\left( {m,\delta} \right)} = \sqrt{\frac{\Lambda^{2}}{2m}\log\frac{2}{\delta}}} & (4)\end{matrix}$

In Equation 5 we insert Equation 4 into Equation 1. We apply theHoeffding inequality (Equation 3) in Equation 6 and obtain the desiredresult in Equation 7.

$\begin{matrix}{{P_{T}\left\lbrack {{{{\hat{f}\left( {h,Q_{m}} \right)} - {f(h)}}} > {E\left( {m,\delta} \right)}} \right\rbrack} = {{P_{T}\left\lbrack {{{{\hat{f}\left( {h,Q_{m}} \right)} - {f(h)}}} > \sqrt{\frac{\Lambda^{2}}{2m}\log\frac{2}{\delta}}} \right\rbrack}\mspace{20mu}(5)}} \\{\leq {2\exp\left\{ {{- 2}m\frac{\left( \sqrt{\frac{\Lambda^{2}}{2m}\log\frac{2}{\delta}} \right)^{2}}{\Lambda^{2}}} \right\}\mspace{110mu}(6)}} \\{{\leq {2\exp\left\{ {{- \log}\frac{2}{\delta}} \right\}}} = {\delta\mspace{220mu}(7)}}\end{matrix}$

For implementation purposes, the Hoeffding inequality is less suitedsince it is not very tight. For large m, we can replace the Hoeffdinginequality by the normal distribution, referring to the central limittheorem. f(h,Q_(m))−f(h) is a random variable with mean value 0; wefurther know that f(h,Q_(m)) is bounded between zero and Λ. In order tocalculate the normal distribution, we need to refer to the true varianceof our random variable. In step 3, the variance is not known since we donot refer to any particular hypothesis. We can only bound the variancefrom above and thus obtain a confidence interval E which is tighter thanHoeffding's/Chernoff's inequality and still satisfies Equation 1.f(h,Q_(m)) is the average of m values, namely

$\frac{1}{m}{\sum\limits_{i = 1}^{m}\;{{{\hat{f}}_{inst}\left( {h,q_{i}} \right)}.}}$

The empirical variance

$s_{{\overset{̑}{f}{({h,Q_{m}})}} - {f{(h)}}} = {\frac{1}{m}\sqrt{\sum\limits_{i = 1}^{m}\left( {{{\hat{f}}_{inst}\left( {h,q_{i}} \right)} - {\hat{f}\left( {h,Q_{m}} \right)}} \right)^{2}}}$is maximized when

${\hat{f}\left( {h,Q_{m}} \right)} = \frac{\Lambda}{2}$and the individual f_(inst)(h,q_(i)) are zero for half the instancesq_(i) and Λ for the other half of all instances. In this case,

$s \leq {\frac{\Lambda}{2\sqrt{m}}.}$

Consequently,

$\frac{2\sqrt{m}\left( {{f\left( {h,Q_{m}} \right)} - {f(h)}} \right)}{\Lambda}$is governed by the standard normal distribution which implies thatEquation 8 satisfies Equation 1. z is the inverse standard normaldistribution that can be looked up in a table.

$\begin{matrix}{{E\left( {m,\delta} \right)} = {z_{1 - \frac{\delta}{2}} \cdot \frac{\Lambda}{2\sqrt{m}}}} & (8)\end{matrix}$

In steps 3(e)i and 3(e)ii, we refer to specific hypotheses h and cantherefore determine the empirical variance of f(h,Q_(m)). We can defineE_(h)(m,ε) as in Equation 10.

$\begin{matrix}{{E\left( {m,\delta} \right)} = {z_{1 - \frac{\delta}{2}} \cdot s_{h}}} & (9) \\{\mspace{79mu}{= {z_{1 - \frac{\delta}{2}}\frac{1}{m}\sqrt{\sum\limits_{i = 1}^{m}\;\left( {{f_{inst}\left( {h,q_{i}} \right)} - {\hat{f}\left( {h,Q_{i}} \right)}} \right)^{2}}}}} & (10)\end{matrix}$

Note that we have simplified the situation a little. We have confusedthe true variance σ (the average squared distance from the true meanf(h)) and the empirical variance S_(h) in Equation 10. The empiricalvariance possesses one degree of freedom less than the true varianceand, to be quite accurate, we would have to refer to Student's tdistribution rather than the normal distribution. Empirically, weobserved that the algorithm does not start to output or discard anyhypotheses until the sample size has reached the order of a hundred. Inthis region, Student's distribution can well be approximated by thenormal distribution and we can keep this treatment (and theimplementation) simple.

Let us now determine a worst-case bound on m (the number of queries thatour sampling algorithm issues). The algorithm exits the for loop (at thelatest) when

${E\left( {m,\frac{\delta}{2{H}}} \right)} \leq {\frac{ɛ}{2}.}$

We can show that this is the case with certainty when

$m \geq {\frac{2\Lambda^{2}}{ɛ^{2}}\log{\frac{H}{2\delta}.}}$

In Equation 11, we expand our definition E. The Λ and log-terms cancelout in Equation 12; we can bound the confidence interval to ε/2 inEquation 13 as required for the algorithm to exit in step 3e.

$\begin{matrix}{{E\left( {{\frac{2\Lambda^{2}}{ɛ^{2}}\log\frac{4{H}}{\delta}},\frac{\delta}{2{H}}} \right)} = \sqrt{\frac{\Lambda^{2}}{2\left( {\frac{2\Lambda^{2}}{ɛ^{2}}\log\frac{4{H}}{\delta}} \right)}\log\frac{2}{\left( \frac{\delta}{2{H}} \right)}}} & (11) \\{\mspace{250mu}{= \sqrt{\frac{ɛ^{2}}{4}\frac{\log\frac{4{H}}{\delta}}{\log\frac{4{H}}{\delta}}}}} & (12) \\{\mspace{250mu}{= \frac{ɛ}{2}}} & (13)\end{matrix}$

But note that our algorithm will generally terminate much earlier;firstly, because we use the normal distribution (for large m) ratherthan the Hoeffding approximation and, secondly, our sequential samplingapproach will terminate much earlier when the n best hypotheses differconsiderably from many of the \bad” hypotheses. The worst case occursonly when all hypotheses in the hypothesis space are equally good whichmakes it much more difficult to identify the n best ones.

4.2 Functions that are Linear in g and (p−p₀)

The first class of nontrivial utility functions that we study weight thegenerality g of a subgroup and the deviation of the probability of acertain feature p from the default probability p0 equally. Hence, thesefunctions multiply generality and distributional unusualness ofsubgroups. Alternatively, we can use the absolute distance |p−p₀|between probability p and default probability p₀. The multi-classversion of this function is

$g\frac{1}{c}{\sum\limits_{c}\;{{p_{i} - p_{0_{i}}}}}$where p_(0i) is the default probability for class i.Theorem 3 Let

$\begin{matrix}{{1.\mspace{14mu}{f(h)}} = {{{g\left( {p - p_{0}} \right)}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}} = {{\hat{g}\left( {\hat{p} - p_{0}} \right)}\mspace{14mu}{or}}}} & \; \\{{2.\mspace{14mu}{f(h)}} = {{g{{p - p_{0}}}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}} = {\hat{g}{{\hat{p} - p_{0}}}\mspace{14mu}{or}}}} & \; \\{{3.\mspace{14mu}{f(h)}} = {{g\frac{1}{c}{\sum\limits_{i = 1}^{c}\;{{{p - p_{0_{i}}}}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}}}} = {\hat{g}\frac{1}{c}{\sum\limits_{i = 1}^{c}{{{{\hat{p}}_{i} - p_{0_{i}}}}\;.}}}}} & \; \\{{{Then}\mspace{14mu}{\Pr\left\lbrack {{{{\hat{f}\left( {h,Q_{m}} \right)} - {f(h)}}} \leq {E\left( {m,\delta} \right)}} \right\rbrack}} \geq {1 - {\delta\mspace{14mu}{when}}}} & \; \\{{{small}\mspace{14mu} m\text{:}\mspace{14mu}{E\left( {m,\delta} \right)}} = {3\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}} & (14) \\{{{large}\mspace{14mu} m\text{:}\mspace{20mu}{E\left( {m,\delta} \right)}} = {\frac{z_{1 - \frac{\delta}{4}}}{\sqrt{m}} + \frac{\left( z_{1 - \frac{\delta}{4}} \right)^{2}}{4m}}} & (15) \\{\mspace{95mu}{{E_{h}\left( {m,\delta} \right)} = {z_{1 - \frac{\delta}{4}}\left( {s_{g} + s_{p} + {z_{1 - \frac{\delta}{4}}s_{g}s_{p}}} \right)}}} & (16)\end{matrix}$

Proof. (3.1) In Equation 17, we insert Equation 14 into Equation 1. Werefer to the union bound in Equation 18. Then, we exploit that ε²≦ε forε≦1 in Equation 19. The simple observation that g≦1 and (p−p₀)≦1 leadsto Equation 20. Equations 21 and 22 are based on elementarytransformations. In Equation 23, we refer to the union bound again. Thekey observation here is that ab cannot be greater than (c+ε) (d+ε)unless at least a>c+ε or b>d+ε. The Chernoff inequality (which is aspecial case of the Hoeffding inequality 3 for Λ=1) takes us to Equation24.

$\begin{matrix}{{{\Pr\left\lbrack {{{{\hat{f}\left( {h,Q_{m}} \right)} - {f(h)}}} > {E\left( {m,\delta} \right)}} \right\rbrack} = {\Pr\left\lbrack {{{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}}} > {3\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}} \right\rbrack}}\mspace{265mu}} & (17) \\{\mspace{340mu}{\leq {2{\Pr\left\lbrack {{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}} > {3\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}} \right\rbrack}}}} & (18) \\{\mspace{340mu}{\leq {2{\Pr\left\lbrack {{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}} > {{2\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} + \left( \sqrt{\frac{1}{2m}\log\frac{4}{\delta}} \right)^{2}}} \right\rbrack}}}} & (19) \\{\mspace{340mu}{\leq {2{\Pr\left\lbrack {{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}} > {{g\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} +}} \right.}}}} & (20) \\\left. \mspace{374mu}{{\left( {p - p_{0}} \right)\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} + \left( \sqrt{\frac{1}{2m}\log\frac{4}{\delta}} \right)^{2}} \right\rbrack & \; \\{\mspace{340mu}{\leq {2{\Pr\left\lbrack {{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}} >}\; \right.}}}} & (21) \\\left. \mspace{371mu}{{\left( {g + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)\left( {p - p_{0} + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)} - {g\left( {p - p_{0}} \right)}} \right\rbrack & \; \\{\mspace{340mu}{\leq {2{\Pr\left\lbrack {{\hat{g}\left( {\hat{p} - p_{0}} \right)} > \mspace{14mu}{\left( {g + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)\left( {p - p_{0} + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)}} \right\rbrack}}}\mspace{56mu}} & {\;(22)} \\{\mspace{340mu}{{\leq {4{\Pr\left\lbrack {\hat{q} > \left( {q + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)} \right\rbrack}}} = {4{\Pr\left\lbrack {{\hat{q} - q} > \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right\rbrack}}}} & (23) \\{\mspace{340mu}{{\leq {4\exp\left\{ {{- 2}m\frac{1}{2m}\log\frac{4}{\delta}} \right\}}} = \delta}} & (24)\end{matrix}$

Let us now prove the normal approximation of the above confidence bound.We start in Equation 25 by inserting Equation 16 into Equation 1. s_(g)and s_(p) denote the variances of g and p, respectively. We also coverEquation 15 in this proof. The variances can be bounded from above:

$s_{g},{s_{p} \leq {\frac{1}{2\sqrt{m}}.}}$

Hence,

${z_{1 - \frac{\delta}{4}}\left( {s_{g} + s_{p} + {z_{1 - \frac{\delta}{4}}s_{g}s_{p}}} \right)} \leq {{2z_{1 - \frac{\delta}{4}}\frac{1}{2\sqrt{m}}} + \left( {z_{1 - \frac{\delta}{4}}\frac{1}{2\sqrt{m}}} \right)^{2}}\mspace{85mu} \leq {\frac{3z_{1{{\_\delta}/4}}}{2\sqrt{m}}.}$

We expand the definition of f and apply the union bound in Equation 26.Equation 27 follows from g≦1 and p−p₀≦1 and Equation 28 is just afactorization of

$g + {z_{1 - \frac{\delta}{4}}{s_{g}.}}$

Again, note that ab cannot be greater than (c+ε) (d+ε) unless a>c+ε orb>d+ε. Applying the union bound in 29 proves the claim.

$\begin{matrix}{\Pr\left\lbrack {{{{\hat{f}\left( {h,Q_{m}} \right)} - {f(h)}}} > {z_{1 - \frac{\delta}{4}}\left( {s_{g} + s_{p} + {z_{1 - \frac{\delta}{4}}s_{g}s_{p}}} \right)}} \right\rbrack} & \; \\{\leq {2{\Pr\left\lbrack {{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}} > {+ {z_{1 - \frac{\delta}{4}}\left( {s_{g} + s_{p} + {z_{1 - \frac{\delta}{4}}s_{g}s_{p}}} \right)}}} \right\rbrack}}} & (25) \\{\leq {2{\Pr\left\lbrack {{{\hat{g}\left( {\hat{p} - p_{0}} \right)} - {g\left( {p - p_{0}} \right)}} > {{{gz}_{1 - \frac{\delta}{4}}s_{g}} + {\left( {p - p_{0}} \right)z_{1 - \frac{\delta}{4}}s_{p}} + {\left( z_{1 - \frac{\delta}{4}} \right)^{2}s_{g}s_{p}}}} \right\rbrack}}} & (26) \\{\leq {2{\Pr\left\lbrack {{\hat{g}\left( {\hat{p} - p_{0}} \right)} > {\left( {g + {z_{1 - \frac{\delta}{4}}s_{g}}} \right)\left( {p - p_{0} + {z_{1 - \frac{\delta}{4}}s_{p}}} \right)}} \right\rbrack}}} & (27) \\{\leq {2\left( {{\Pr\left\lbrack {\hat{g} > {g + {z_{1 - \frac{\delta}{4}}s_{g}}}} \right\rbrack} + {\Pr\left\lbrack {\hat{p} > {p + {z_{1 - \frac{\delta}{4}}s_{p}}}} \right\rbrack}} \right)}} & (28) \\{{\leq {2\left( {\frac{\delta}{4} + \frac{\delta}{4}} \right)}} = \delta} & (29)\end{matrix}$

This completes the proof for Theorem (3.1).

(3.2) Instead of having to estimate p, we need to estimate the randomvariable |p−p₀|. We define s_(p) to be the empirical variance of |p−p₀|.Since this value is bounded between zero and one, all the argumentswhich we used in the last part of this proof apply analogously.

(3.3) Here, the random variable is

$\frac{1}{c}{\sum\limits_{i = 1}^{c}\;{{{p_{i} - p_{0_{i}}}}.}}$

This variable is also bounded between zero and one and so the proof isanalogous to case (3.1). This completes the proof.

Theorem 4 For all function f(h) covered by Theorem 3, the samplingalgorithm will terminate after at most

$\begin{matrix}{m = {\frac{18}{ɛ^{2}}\log\frac{8{H_{i}}}{\delta}}} & (30)\end{matrix}$database queries (but usually much earlier).

Proof. The algorithm terminates in step 3 when

${E\left( {i,\frac{\delta}{2{H}}} \right)} \leq {\frac{ɛ}{2}.}$

We will show that this is always the case when

${i \geq m} = {\frac{16}{ɛ^{2}}\log{\frac{\sqrt{6}{H_{i}}}{\sqrt{\delta}}.}}$

We insert the sample bound (Equation 30) into the definition of E forlinear functions (Equation 14), after the log-terms rule out each otherin Equation 31 we obtain the desired bound of ε/2.

$\begin{matrix}{{E\left( {{\frac{18}{ɛ^{2}}\log\frac{8{H}}{\delta}},\frac{\delta}{2{H}}} \right)} = {3\sqrt{\frac{1}{2\left( {\frac{18}{ɛ^{2}}\log\frac{8{H}}{\delta}} \right)}\log\frac{4}{\left( \frac{\delta}{2{H}} \right)}}}} & (31) \\{\mspace{230mu}{{\leq {3\sqrt{\frac{ɛ^{2}}{36}}}} = \frac{ɛ}{2}}} & (32)\end{matrix}$

This completes the proof.

4.3 Functions with Squared Terms

Squared terms [7] are introduced to put more emphasis on the differencebetween p and the default probability.

Theorem 5 Let

$\begin{matrix}{{1.\mspace{14mu}{f(h)}} = {{{g^{2}\left( {p - p_{0}} \right)}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}} = {{{\hat{g}}^{2}\left( {\hat{p} - p_{0}} \right)}\mspace{14mu}{or}}}} & \; \\{{2.\mspace{14mu} f(h)} = {{g^{2}{{p - p_{0}}}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}} = {{\hat{g}}^{2}{{\hat{p} - p_{0}}}\mspace{14mu}{or}}}} & \; \\{{3.\mspace{14mu}{f(h)}} = {{g^{2}\frac{1}{c}{\sum\limits_{i = 1}^{c}\;{{{p_{i} - p_{0_{i}}}}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}}}} = {{\hat{g}}^{2}\frac{1}{c}{\sum\limits_{i = 1}^{c}\;{{{{\hat{p}}_{i} - p_{0_{i}}}}\mspace{14mu}{or}}}}}} & \; \\{{{Then}\mspace{14mu}{\Pr\left\lbrack {{{f\left( {h,Q_{m}} \right)} - {f\left( {h,Q} \right)}} \leq {E\left( {m,\delta} \right)}} \right\rbrack}} \geq {1 - {\delta\mspace{14mu}{when}}}} & \; \\{{{small}\mspace{14mu} m\text{:}\mspace{14mu}{E\left( {m,\delta} \right)}} = {\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)^{\frac{3}{2}} + {3\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)} + {3\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}}} & (33) \\{{{large}\mspace{14mu} m\text{:}\mspace{20mu}{E\left( {m,\delta} \right)}} = {{\frac{3}{2\sqrt{m}}z_{1 - \frac{\delta}{2}}} + {\frac{m + \sqrt{m}}{4m\sqrt{m}}\left( z_{1 - \frac{\delta}{2}} \right)^{2}} + {\frac{1}{8m\sqrt{m}}\left( z_{1 - \frac{\delta}{2}} \right)^{3}}}} & (34) \\{{E_{h}\left( {m,\delta} \right)} = {{2s_{g}z_{1 - \frac{\delta}{2}}} + {s_{g}^{2}\left( z_{1 - \frac{\delta}{2}} \right)}^{2} + {s_{p}z_{1 - \frac{\delta}{2}}} + {2s_{g}{s_{p}\left( z_{1 - \frac{\delta}{2}} \right)}^{2}} + {s_{p}{s_{g}^{2}\left( {+ z_{1 - \frac{\delta}{2}}} \right)}^{3}}}} & (35)\end{matrix}$

Proof. (5.1) f(h)=g²(p−p₀) As usual, we start in Equation 37 bycombining the definition of E (Equation 33) with Equation 1 whichspecifies the property of E that we would like to prove. In Equation 37we exploit that g≦1 and p−p₀≦1. In Equation 38 we add g²(p−p₀) to bothsides of the inequality an start factorizing. In Equation 39 we haveidentified three factors. The observation that a²b cannot be greaterthan (c+ε)(c+ε)(d+ε) unless at least a>c+ε or b>d+ε and the union boundlead to Equation 40; the Chernoff inequality completes this part of theproof.

$\begin{matrix}{\Pr\left\lbrack {{{{\hat{f}\left( {h,Q_{m}} \right)} - {f(h)}}} > {E\left( {m,\delta} \right)}} \right\rbrack} & \; \\{= {\Pr\left\lbrack {{{{{\hat{g}}^{2}\left( {p - p_{0}} \right)} - {g^{2}\left( {p - p_{0}} \right)}}} > {\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)^{\frac{3}{2}} + {3\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)} + {3\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}}} \right\rbrack}} & (36) \\{\leq {2{\Pr\left\lbrack {{{{\hat{g}}^{2}\left( {\hat{p} - p_{0}} \right)} - {g^{2}\left( {p - p_{0}} \right)}} > {{g^{2}\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} + {2{g\left( {p - p_{0}} \right)}}}} \right.}}} & (37) \\\left. \mspace{31mu}{\sqrt{\frac{1}{2m}\log\frac{4}{\delta}} + {2{g\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)}} + {\left( {p - p_{0}} \right)\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)} + \left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)^{\frac{3}{3}}} \right\rbrack & \; \\{\leq {2{\Pr\left\lbrack {{{\hat{g}}^{2}\left( {\hat{p} - p_{0}} \right)} > {\left( {g^{2} + {2g\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} + \left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)} \right)\left( {p - p_{0} + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)}} \right\rbrack}}} & (38) \\{\leq {2{\Pr\left\lbrack {{{\hat{g}}^{2}\left( {\hat{p} - p_{0}} \right)} > {\left( {g + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right) + {\left( {g + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)\left( {p - p_{0} + \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right)}}} \right\rbrack}}} & (39) \\{\leq {2\left( {{\Pr\left\lbrack {{\hat{g} - g} > \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right\rbrack} + {\Pr\left\lbrack {{\hat{p} - p} > \sqrt{\frac{1}{2m}\log\frac{4}{\delta}}} \right\rbrack}} \right)}} & (40) \\{{\leq {4\exp\left\{ {{- 2}{m\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)}} \right\}}} = \delta} & (41)\end{matrix}$

Let us now look at the normal approximation. First, we will make surethat Equation 34 is a special case of Equation 35 (variance bounded fromabove). The variance of both g and p is at most

$\frac{1}{2\sqrt{m}}.$

This takes us from Equation 35 to Equation 42. Equation 43 equalsEquation 34.

$\begin{matrix}\begin{matrix}{{2s_{g}z_{1 - \frac{\delta}{2}}} + {s_{g}^{2}\left( z_{1 - \frac{\delta}{2}} \right)}^{2} + {s_{p}z_{1 - \frac{\delta}{2}}} + {2s_{g}{s_{p}\left( z_{1 - \frac{\delta}{2}} \right)}^{2}} +} \\{{s_{p}{s_{g}^{2}\left( z_{1 - \frac{\delta}{2}} \right)}^{3}} \leq {{\frac{1}{\sqrt{m}}z_{1 - \frac{\delta}{2}}} + {\frac{1}{4m}\left( z_{1 - \frac{\delta}{2}} \right)^{2}} + \frac{1}{2\sqrt{m}}}} \\{z_{1 - \frac{\delta}{2}} + {\frac{1}{\sqrt{m}}\left( z_{1 - \frac{\delta}{2}} \right)^{2}} + {\frac{1}{8m\sqrt{m}}\left( z_{1 - \frac{\delta}{2}} \right)^{3}}}\end{matrix} & (42) \\{= {{\frac{3}{2\sqrt{m}}z_{1 - \frac{\delta}{2}}} + {\frac{m + \sqrt{m}}{4m\sqrt{m}}\left( z_{1 - \frac{\delta}{2}} \right)^{2}} + {\frac{1}{8m\sqrt{m}}\left( z_{1 - \frac{\delta}{2}} \right)^{3}}}} & (43)\end{matrix}$

In Equation 44 we want to see if the normal approximation (Equation 35)satisfies the requirement of Equation 1. We add g²(p−p₀) to both sidesof the equation and start factorizing the right hand side of theinequality in Equations 45 and 46. The union bound takes us to Equation47; Equation 48 proves the claim.

$\begin{matrix}\begin{matrix}{{\Pr\left\lbrack {{{{f\left( {h,Q_{m}} \right)} - {f(h)}}} > {E\left( {m,\delta} \right)}} \right\rbrack} =} \\{\Pr\left\lbrack {{{{g^{2}\left( {p - p_{0}} \right)} - {g^{2}\left( {p - p_{0}} \right)}}} > {{2s_{g}z_{1 - \frac{\delta}{2}}} +}} \right.} \\{\left. {{s_{g}^{2}\left( z_{1 - \frac{\delta}{2}} \right)}^{2} + {s_{p}z_{1 - \frac{\delta}{2}}} + {2s_{g}{s_{p}\left( z_{1 - \frac{\delta}{2}} \right)}^{2}} + {s_{p}{s_{g}^{2}\left( z_{1 - \frac{\delta}{2}} \right)}^{3}}} \right\rbrack \leq}\end{matrix} & (44) \\{{2{\Pr\begin{bmatrix}{{{\hat{g}}^{2}\left( {\hat{p} - p_{0}} \right)} > \left( {g^{2} + {2{gs}_{g}z_{1 - \frac{\delta}{2}}} + {s_{g}^{2}\left( z_{1 - \frac{\delta}{2}} \right)}^{2}} \right)} \\\left( {p + {s_{p}z_{1 - \frac{\delta}{2}}}} \right)\end{bmatrix}}} \leq} & (45) \\{{2{\Pr\begin{bmatrix}{{{\hat{g}}^{2}\left( {\hat{p} - p_{0}} \right)} > {\left( {g + {s_{g}z_{1 - \frac{\delta}{2}}}} \right)\left( {g + {s_{g}z_{1 - \frac{\delta}{2}}}} \right)}} \\\left( {p + {s_{p}z_{1 - \frac{\delta}{2}}}} \right)\end{bmatrix}}} \leq} & (46) \\{{2\left( {{\Pr\left\lbrack {{\hat{g} - g} > {s_{g}z_{1 - \frac{\delta}{2}}}} \right\rbrack} + {\Pr\left\lbrack {{\hat{p} - p} > {s_{p}z_{1 - \frac{\delta}{2}}}} \right\rbrack}} \right)} \leq} & (47) \\{{2\left( {\frac{\delta}{4} + \frac{\delta}{4}} \right)} = \delta} & (48)\end{matrix}$

This proves case (5.1). For cases (5.2) and (5.3), note that the randomvariables |p−p₀| and

$\frac{1}{c}{\sum\limits_{i = 1}^{\;}{c\left( {p - p_{0}} \right)}}$(both bounded between zero and one) play the role of p and the proof isanalogous to the first case (5.1).

Theorem 6 For all functions f(h) covered by Theorem 5, the samplingalgorithm will terminate after at most

$\begin{matrix}{m = {\frac{98}{ɛ^{2}}\log\frac{\left. 8 \middle| H_{i} \right|}{\delta}}} & (49)\end{matrix}$

Proof. The algorithm terminates in step 3 when

${E\left( {i,\frac{\delta}{\left. 2 \middle| H \right|}} \right)} \leq {\frac{ɛ}{2}.}$

The utility functions of Theorem 5 are bounded between zero and one.Hence, we can assume that ε≦1 since otherwise the algorithm might justreturn n arbitrarily poor hypotheses and still meet the requirements ofTheorem 1. This means that the algorithm cannot exit until

${E\left( {m,\frac{\delta}{\left. 2 \middle| H_{i} \right|}} \right)} \leq \frac{1}{2}$(or n hypotheses have been returned). For

$E\left( {m,\frac{\delta}{\left. 2 \middle| H_{i} \right|}} \right)$to be ½ or less, each of the three terms in Equation 33 has to bebelow 1. Note that if ε<1 then ε²<ε. We can therefore bound E as inEquation 51.

$\begin{matrix}{{E\left( {m,\delta} \right)} = {{\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)^{\frac{3}{2}} + {3\left( {\frac{1}{2m}\log\frac{4}{\delta}} \right)} + {3\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}} <}} & (50) \\{\mspace{104mu}{7\sqrt{\frac{1}{2m}\log\frac{4}{\delta}}}} & (51)\end{matrix}$

Now we will show that E lies below ε/2 when m reaches the bounddescribed in Equation 49. We insert the sample bound into the exitcriterion in Equation 52. The log-terms rule out each other and theresult is ε/2 as desired.

$\begin{matrix}{{{E\left( {{\frac{98}{ɛ^{2}}\log\frac{\left. 8 \middle| H \right|}{\delta}},\frac{\delta}{\left. 2 \middle| H \right|}} \right)} <}\mspace{349mu}} & (52) \\{{7\sqrt{\frac{1}{2\left( {\frac{98}{ɛ^{2}}\log\frac{\left. 8 \middle| H \right|}{\delta}} \right)}\log\frac{4}{\left( \frac{\delta}{\left. 2 \middle| H_{i} \right|} \right)}}} \leq} & \; \\{{7\sqrt{\frac{ɛ^{2}}{4 \cdot 49}}} = \frac{ɛ}{2}} & (53)\end{matrix}$

This completes the proof.

4.4 Functions Based on the Binomial Test

The Binomial test heuristic is based on elementary considerations.Suppose that the probability p is really equal to p₀ (i.e., thecorresponding subgroup is really uninteresting). How likely is it, thatthe subgroup with generality g displays a frequency of {circumflex over(p)} on the sample Q with a greater difference |{circumflex over(p)}−p₀|? For large |Q|×g, ({circumflex over (p)}−p₀) is governed by thenormal distribution with mean value of zero and variance at most

$\mspace{20mu}{\frac{1}{2\sqrt{m}}.}$

The probability density function of the normal distribution ismonotonic, and so the resulting confidence is order-equivalent to√{square root over (m)}(p−p₀)(m being the support) which is factor equivalent to√{square root over (g)}(p−p₀).

Several variants of this utility function have been used.

Theorem 7 Let

${1.\mspace{14mu}{f(h)}} = {{\sqrt{g}\left( {p - p_{0}} \right)\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}} = {\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)\mspace{14mu}{or}}}$${2.\mspace{14mu}{f(h)}} = {{\sqrt{g}{{p - p_{0}}}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}} = {\sqrt{\hat{g}}{{\hat{p} - p_{0}}}\mspace{14mu}{or}}}$${3.\mspace{14mu}{f(h)}} = {{\sqrt{g}\frac{1}{c}{\sum\limits_{i = 1}^{c}{{{p_{i} - p_{0_{i}}}}\mspace{14mu}{and}\mspace{14mu}{\hat{f}\left( {h,Q} \right)}}}} = {{\sqrt{\hat{g}}\frac{1}{c}{\sum\limits_{i = 1}^{c}{{{{\hat{p} - p_{0}}}.{Then}}\mspace{14mu}{\Pr\left\lbrack {{{{f\left( {h,Q_{m}} \right)} - {f(h)}}} \leq {E\left( {m,\delta} \right)}} \right\rbrack}}}} \geq {1 - {\delta\mspace{14mu}{when}}}}}$$\begin{matrix}{{{small}\mspace{14mu} m\text{:}\mspace{14mu}{E\left( {m,\delta} \right)}} = {\sqrt[2]{\frac{1}{2m}\log\;\frac{4}{\delta}} + \sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}} + \left( {\frac{1}{2m}\log\;\frac{4}{\delta}} \right)^{\frac{3}{4}}}} & (54) \\{{{large}\mspace{14mu} m\text{:}\mspace{14mu}{E\left( {m,\delta} \right)}} = {\sqrt{\frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}}} + \frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}} + \left( \frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}} \right)^{3/2}}} & (55) \\{{E_{h}\left( {m,\delta} \right)} = {\sqrt{s_{g}z_{1 - \frac{\delta}{4}}} + {s_{p}z_{1 - \frac{\delta}{4}}} + {\sqrt{s_{g}z_{1 - \frac{\delta}{4}}}s_{p}z_{1 - \frac{\delta}{4}}}}} & (56)\end{matrix}$

Proof. (7.1) In Equation 57, we insert Equation 54 into Equation 1 (thedefinition of E). We refer to the union bound in Equation 58 and exploitthat√{square root over (g)}≦1 and p−p ₀≦1.

As usual, we factor the right hand side of the inequality in Equation 59and use the union bound in Equation 60. Now in Equation 61 we weaken theinequality a little. Note that4√{square root over (x)}≧√{square root over ( )}√{square root over(x−y)} when y>0.

Hence, subtracting the lengthy term in Equation 61 decreases theprobability of the inequality (which we want to bound from above). Thereason why we subtract this term is that we want to apply the binomialequation and factor√{square root over (g+ε)}−√{square root over (g)}.

We do this in the following steps 62 and 63 which are perhaps a littlehard to check without a computer algebra system. Adding√{square root over (g)}and taking both sides of the inequality to the square leads to Equation64, the Chernoff inequality leads to the desired default of δ.

Pr [f̂(h, Q_(m)) − f(h) > E(m, δ)] $\begin{matrix}{= {2{\Pr\left\lbrack {{{\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)} - {\sqrt{g}\left( {p - p_{0}} \right)}} > {\sqrt[2]{\frac{1}{2m}\log\;\frac{4}{\delta}} + \sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}} + \left( {\frac{1}{2m}\log\;\frac{4}{\delta}} \right)^{\frac{3}{4}}}} \right\rbrack}}} & (57) \\{\leq {2{\Pr\left\lbrack {{{\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)} - {\sqrt{g}\left( {p - p_{0}} \right)}} > {{\sqrt{g}\sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}} + {\left( {p - p_{0}} \right)\sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}}} + \left( {\frac{1}{2m}\log\;\frac{4}{\delta}} \right)^{\frac{3}{4}}}} \right\rbrack}}} & (58) \\{\leq {2{\Pr\left\lbrack {{\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)} > {\left( {\sqrt{g} + \sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}}} \right)\left( {p - p_{0} + \sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}} \right)}} \right\rbrack}}} & (59) \\{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}}} \right\rbrack}} + {2{\Pr\left\lbrack {{\hat{p} - p} > \sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}} \right\rbrack}}}} & (60) \\{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt{\sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}} - {2\left( \sqrt{g^{2} + {g\sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}}} \right)}}} \right\rbrack}} + {2\exp\left\{ {{- 2}m\frac{1}{2m}\log\;\frac{4}{\delta}} \right\}}}} & (61) \\{= {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt{{2g} + \sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}} - {2\sqrt{g^{2} + \left( {g\sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}} \right)}}}} \right\rbrack}} + \frac{\delta}{2}}} & (62) \\{= {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > {\sqrt{g + \sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}} - \sqrt{g}}} \right\rbrack}} + \frac{\delta}{2}}} & (63) \\{= {{{2{\Pr\left\lbrack {{\hat{g} - g} > \sqrt{\frac{1}{2m}\log\;\frac{4}{\delta}}} \right\rbrack}} + \frac{\delta}{2}} = \delta}} & (64)\end{matrix}$

Now we still need to prove the normal approximations (Equations 55 and56). As usual, we would like Equation 55 to be a special case ofEquation 55 with the variances bounded from above. Equation 65 confirmsthat this is the case since

$\begin{matrix}{s_{p,g} \leq {\frac{1}{2\sqrt{m}}.}} & \; \\{{{\sqrt{s_{g}z_{1 - \frac{\delta}{4}}} + {s_{p}z_{1 - \frac{\delta}{4}}} + {\sqrt{s_{g}z_{1 - \frac{\delta}{4}}}s_{p}z_{1 - \frac{\delta}{4}}}} \leq}\mspace{295mu}} & (65) \\{\mspace{329mu}{\sqrt{\frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}}} - \frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}} + {\sqrt{\frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}}}\frac{z_{1 - \frac{\delta}{4}}}{2\sqrt{m}}}}} & \;\end{matrix}$

This derivation is quite analogous to the previous one. We multiply theterms on the right hand side by factor which are less or equal to one(Equation 67) and then factor the right hand side (Equation 68). Wesubtract a small number from

$s_{g}z_{1 - \frac{\delta}{4}}$in Equation 69 and factor√{square root over (ĝ)}−√{square root over (g)}in Equation 70 and Equation 71. Basic manipulations and the Chernoffinequality complete the proof in Equation 73.

Pr [f̂(h, Q_(m)) − f(h) > E(m, δ)] $\begin{matrix}{\leq {2{\Pr\left\lbrack {{{\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)} - {\sqrt{g}\left( {p - p_{0}} \right)}} > {\sqrt{s_{g}z_{1 - \frac{\delta}{4}}} + {s_{p}z_{1 - \frac{\delta}{4}}} + {\sqrt{s_{g}z_{1 - \frac{\delta}{4}}}s_{p}z_{1 - \frac{\delta}{4}}}}} \right\rbrack}}} & (66) \\{\leq {2{\Pr\left\lbrack {{{\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)} - {\sqrt{g}\left( {p - p_{0}} \right)}} > {\sqrt{g\; s_{g}z_{1 - \frac{\delta}{4}}} - {\left( {p - p_{0}} \right)s_{p}z_{1 - \frac{\delta}{4}}} + {s_{g}z_{1 - \frac{\delta}{4}}\sqrt{s_{p}z_{1 - \frac{\delta}{4}}}}}} \right\rbrack}}} & (67) \\{\leq {2{\Pr\left\lbrack {{\sqrt{\hat{g}}\left( {\hat{p} - p_{0}} \right)} > {\left( {\sqrt{g} + \sqrt{s_{g}z_{1 - \frac{\delta}{4}}}} \right)\left( {p - p_{0} + {s_{p}z_{1 - {1\frac{\delta}{4}}}s_{p}}} \right)}} \right\rbrack}}} & (68) \\{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt{s_{g}z_{1 - \frac{\delta}{4}}}} \right\rbrack}} + {2{\Pr\left\lbrack {{\hat{p} - p} > {s_{p}z_{1 - \frac{\delta}{4}}}} \right\rbrack}}}} & (69) \\{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt{s + {g\; z_{1 - \frac{\delta}{4}}} - {2\left( {\sqrt{g^{2} + {{gs}_{g}z_{1 - \frac{\delta}{4}}}} + \sqrt{g^{2}}} \right)}}} \right\rbrack}} + \frac{\delta}{2}}} & (70) \\{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt{{2g} + {s_{g}\; z_{1 - \frac{\delta}{4}}} - {2\left( {g\left( {g + {s_{g}z_{1 - \frac{\delta}{4}}}} \right)} \right)}}} \right\rbrack}} + \frac{\delta}{2}}} & (71) \\{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > \sqrt{g + {s_{g}z_{1 - \frac{\delta}{4}}}}} \right\rbrack}} + \frac{\delta}{2}}} & (72) \\{{\leq {{2{\Pr\left\lbrack {{\sqrt{\hat{g}} - \sqrt{g}} > {s_{g}z_{1 - \frac{\delta}{4}}}} \right\rbrack}} + \frac{\delta}{2}}} = \delta} & (73)\end{matrix}$

This completes the proof for Theorem (7.1). The proofs of cases (7.2)and (7.2) are analogous; instead of p we need to estimate

${{{p - p_{0}}}\mspace{14mu}{and}\mspace{14mu}\frac{1}{c}{\sum\limits_{i = 1}\left( {p_{i} - p_{0_{i}}} \right)}},$respectively. Both random variables are bounded between zero and one andso all our previous arguments apply. This completes the proof of Theorem7.

Theorem 8 For all functions f(h) covered by Theorem 7, the samplingalgorithm will terminate after at most

$\begin{matrix}{m = {\frac{648}{ɛ^{2}}\log\;\frac{8{H_{i}}}{\delta}}} & (74)\end{matrix}$database queries (but usually much earlier).

Proof. Proving the last theorem was a little tiring, so let us first seehow we can find a simpler bound for E (Equation 54) which will help usto prove a sample size bound without having to think too hard. Themiddle term of Equation 55 dominates the expression since, for

$ɛ \leq {1\mspace{14mu}{it}\mspace{14mu}{is}\mspace{14mu}{true}\mspace{14mu}{that}\mspace{14mu}\sqrt[4]{ɛ}} \geq \sqrt{ɛ} \geq {ɛ^{\frac{3}{4}}.}$

Hence, Equation 75 provides us with an easier bound.

$\begin{matrix}{{\sqrt[2]{\frac{1}{2m}\log\;\frac{4}{\delta}} + \sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}} + \left( {\frac{1}{2m}\log\;\frac{4}{\delta}} \right)^{\frac{3}{4}}} \leq {3\sqrt[4]{\frac{1}{2m}\log\;\frac{4}{\delta}}}} & (75)\end{matrix}$

The algorithm terminates in step 3 when

${E\left( {m,\frac{\delta}{2{H}}} \right)} \leq {\frac{ɛ}{2}.}$

Considering the sample bound in Equation 74, Equation 76 proves thatthis is the case with guarantee. Note that, since we bounded theconfidence interval quite sloppily, we expect the algorithm to terminateconsiderably earlier.

$\begin{matrix}{{E\left( {{\frac{648}{ɛ^{2}}\log\frac{8{H}}{\delta}},\frac{\delta}{2{H}}} \right)} < {3{\,^{4}\sqrt{\frac{1}{2\left( {\frac{648}{ɛ^{2}}\log\frac{8{H}}{\delta}} \right)}\log\frac{4}{\left( \frac{\delta}{2{H}} \right)}}}}} & (76) \\{\mspace{250mu}{{\leq {3{\,^{4}\sqrt{\frac{ɛ^{2}}{16 \cdot 81}}}}} = \frac{ɛ}{2}}} & (77)\end{matrix}$

This completes the proof.

4.5 Negative Results

Several independent impurity criteria have led to utility functionswhich are factor-equivalent to

${{f(h)} = {\frac{g}{1 - g}\left( {p - p_{0}} \right)^{2}}};$e.g., Gini diversity index and twoing criterion, and the chi-squaretest. Note that it is also order-equivalent to the utility measure usedin Inferrule. Unfortunately, this utility function is not bounded and afew examples that have not been included in the sample can imposedramatic changes on the values of this function. This motivates ournegative result.

Theorem 9 There is no algorithm that satisfies Theorem 1 when

${f(h)} = {\frac{g}{1 - g}{\left( {p - p_{0}} \right)^{2}.}}$

Proof. We need to show that{circumflex over (f)}(h, Q _(m))−f(h);is unbounded for any finite m. This is easy since

$\frac{g + ɛ}{1 - \left( {g + ɛ} \right)} - \frac{g}{1 - g}$goes to infinity when g approaches 1 or 1−ε(Equation 78).

$\begin{matrix}{{\frac{g + ɛ}{1 - \left( {g + ɛ} \right)} - \frac{g}{1 - g}} = \frac{ɛ}{\left( {g + ɛ - 1} \right)\left( {g - 1} \right)}} & (78)\end{matrix}$

This implies that, even after an arbitrarily large sample has beenobserved (that is smaller than the whole database), the utility of ahypothesis with respect to the sample can be arbitrarily far from thetrue utility. But one may argue that demanding {circumflex over(f)}(h,Q) to be within an additive constant ε is overly restricted.However, the picture does not change when we require {circumflex over(f)}(h,Q) only to be within a multiplicative constant, since

$\frac{g + ɛ}{1 - \left( {g + ɛ} \right)}/\frac{g}{1 - g}$goes to infinity when g+ε approach 1 or g approaches zero (Equation 79).

$\begin{matrix}{{\frac{g + ɛ}{1 - \left( {g + ɛ} \right)}/\frac{g}{1 - g}} = \frac{\left( {g + ɛ} \right)\left( {1 - g} \right)}{g\left( {1 - g - ɛ} \right)}} & (79)\end{matrix}$

This means that no sample suffices to bound f(h,Q_(m))−f(h) with highconfidence when a particular f(h,Q_(m)) is measured. When a samplingalgorithm uses all but very few database transactions as sample, thenthe few remaining examples may still impose huge changes on f(h,Q_(m))which renders the use of sampling algorithms prohibitive. This completesthe proof.

5 Experiments

In our experiments, we want to study the order of magnitude of exampleswhich are required by our algorithm for realistic tasks. Furthermore, wewant to measure how much of an improvement our sequential samplingalgorithm achieves over a static sampling algorithm that determines thesample size with worst-case bounds.

We implemented a simple subgroup discovery algorithm. Hypotheses consistof conjunctions of up to k attribute value tests. For discreteattributes, we allow tests for any of the possible values (e.g.,“color=green”); we discretize all continuous attributes and allow fortesting whether the value of such attributes lies in an interval (e.g.,“size ∈[2.3, 5.8]”).

We also implemented a non-sequential sampling algorithm in order toquantify the relative benefit of sequential sampling. The non-sequentialalgorithm determines a sample size M like our algorithm does in step 2,but using the full available error probability δ rather than only δ/2.Hence, the non-sequential sampling algorithm has a lower worst-casesample size than the sequential one but never exits or returns anyhypothesis before that worst-case sample bound has been reached.Sequential and non-sequential sampling algorithm use the same normalapproximation and come with identical guarantees on the quality of thereturned solution.

For the first set of experiments, we used a database of 14,000 fruitjuice purchase transactions. Each transaction is described by 29attributes which specify properties of the purchased juice as well asattributes of the customer (e.g., age and job). The task is to identifysubgroups of customers that differ from the overall average with respectto their preference for cans, recyclable bottles, or non-recyclablebottles. For this problem, we studied hypothesis spaces of size 288(k=1, hypotheses test one attribute for a particular value), 37,717(k=2, conjunctions of two tests), and 3,013,794 (k=3, conjunctions ofthree tests). Since

has only a minor (logarithmic) influence on the resulting sample size,all results presented in FIG. 1 were obtained with δ=0.1. We varied theutility function; the target attribute has three possible values, so weused the utility functions

${f_{1} = {g\frac{1}{3}{\sum\limits_{i = 1}^{3}\;{{p_{i} - p_{0_{i}}}}}}},{f_{2} = {g^{2}\frac{1}{3}{\sum\limits_{i = 1}^{3}\;{{p_{i} - p_{0_{i}}}}}}},{and}$$f_{3} = {\sqrt{g}\frac{1}{3}{\sum\limits_{i = 1}^{3}\;{{{p_{i} - p_{0_{i}}}}.}}}$

FIG. 1 shows the sample size of the non-sequential algorithm as well asthe sample size required before the sequential algorithm returned thefirst (out of ten) hypothesis and the sample size that the sequentialalgorithm required to return the last (tenth) hypothesis and terminate.In every single experiment that we run, the sequential samplingalgorithm terminated significantly earlier than the non-sequential one,even though the latter possesses a lower worst-case sample bound. As“becomes small, the relative benefit of sequential sampling can reachorders of magnitude. Consider, for instance, the linear utility functionand k=1, ε=0.1, δ=0.1. We can return the first hypothesis after 9,800examples whereas the non-sequential algorithm returns the solution onlyafter 565,290 examples. The sample size of the sequential algorithm isstill reasonable for k=3 and we expect it not to grow too fast forlarger values as the worst-case bound is logarithmic in |H|—i.e., linearin k.

For the second set of experiments, we used the data provided for the KDDcup 1998. The data contains 95,412 records that describe mailings by aveterans organization. Each record contains 481 attributes describingone recipient of a previous mailing. The target fields note whether theperson responded and how high his donation to the organization was. Ourtask was to find large subgroups of recipients that were particularlylikely (or unlikely) to respond (we used the attribute “Target B” astarget and deleted “Target D”). We discretized all numeric attributes(using five discrete values); our hypothesis space consists of all 4492attribute value tests.

FIG. 2 displays the sample sizes that our sequential sampling algorithm,as well as the non-sequential sampling algorithm that comes with exactlythe same guarantee regarding the quality of the solutions required. Notethat we use a logarithmic (log₁₀) scale on the y axis. Although it isfair to say that this is a large-scale problem, the sample sizes used bythe sequential sampling algorithm are in a reasonable range for allthree studied utility functions. Less than 10,000 examples are requiredwhen ε is as small as 0.002 for f=g|p−p₀| and f=g²|p−p₀| and when ε0 is0.05 forf=√{square root over (g)}|p−p ₀|.

The relative benegit of sequential over non-sequential sampling is quitesignigicant. For instance, in FIG. 2 a (ε=0.002) the non-sequentialalgorithm requires over 10⁷ examples (of course, much more than areavailable) whereas the sequential one needs still less than 10⁴.

6 Discussion and Related Results

Learning algorithms that require a number of examples which can beguaranteed to suffice for finding a nearly optimal hypothesis even inthe worst case have early on been criticized as being impractical.Sequential learning techniques have been known in statistics for sometime [6]. [3] have introduced sequential sampling techniques into themachine learning context by proposing the “Hoeffding Race” algorithmthat combines loop-reversal with adaptive Hoeffding bounds. A generalscheme for sequential local search with instance-averaging utilityfunctions has been proposed by Greiner [2].

Sampling techniques are particularly needed in the context of knowledgediscovery in databases where often much more data are available than canbe processed. A non-sequential sampling algorithm for KDD has beenpresented by Toivonen [5]; a sequential algorithm by Domingo et al. [1].A preliminary version of the algorithm presented in this paper has beendiscussed in [4]. This preliminary algorithm, however, did not useutility confidence bounds and its empirical behavior was less favorablethan the behavior of the algorithm presented here. Our algorithm wasinspired by the local searching algorithm of Greiner [2] but differsfrom it in a number of ways. The most important difference is that werefer to utility confidence bounds which makes it possible to, handleall utility functions that can be estimated with bounded error eventhough they may not be an average across all instances.

In classification learning, error probabilities are clearly thedominating utility criterion. This is probably the reason why allsampling algorithms that have been studied so far are restricted toinstance averaging utility functions. In many areas of machine learningand knowledge discovery (such as association rule and subgroupdiscovery), instance averaging utility functions are clearlyinappropriate. The sampling algorithm of Domingo et al. [1] allows forutility criteria which are a function (with bounded derivative) of anaverage over the instances. This, too, does not cover popular utilityfunctions (such as g|p−p₀|) which depend on two averages (g and |p−p₀|)across the instances. Our algorithm is more general and works for allutility criteria for which a confidence interval can be found. Wepresented a list of instantiations for the most popular utility functionfor knowledge discovery tasks and showed that there is no solution forone function. Another minor difference between our algorithm and the oneof [1] is that (when the utility confidence bound vanishes) ouralgorithm can be guaranteed to terminate with certainty (not just highprobability) when it has reached a worst-case sample size bound.

So far, learning and discovery algorithms return the best hypothesis orall hypotheses over a certain utility threshold. Often, in particular inthe context of knowledge discovery tasks, a user is interested in beingprovided with a number of the best hypotheses. Our algorithm returns then approximately best hypotheses.

The approach that we pursue differs from the (PAC-style) worst-caseapproach by requiring smaller samples in all cases that are distinctfrom the worst case (in which all hypotheses are equally good). Insteadof operating with smaller samples, it is also possible to work with afixed-size sample but guarantee a higher quality of the solution if theobserved situation differs from this worst case. This is the generalidea of shell decomposition bounds and self-bounding learningalgorithms.

Although we have discussed our algorithm only in the context ofknowledge discovery tasks, it should be noted that the problem which weaddress is relevant in a much wider context. A learning agent thatactively collects data and searches for a hypothesis (perhaps a controlpolicy) which maximizes its utility function has to decide at whichpoint no further improvement can be achieved by collecting more data.The utility function of an intelligent agent will generally be morecomplicated than an average over the observations. Our sequentialsampling algorithm provides a framework for solving such problems.

As it is stated currently, our algorithm represents all consideredhypotheses explicitly. It can therefore only be applied practically whenthe hypothesis space is relatively small. This is the case for mostknowledge discovery tasks. The space of all association rules orsubgroups over a certain number of attributes (which grows singlyexponential in the number of monomials allowed) is much smaller than,for instance, the space of all decision trees (growing doublyexponential). However, most hypothesis spaces possess a symmetricstructure which renders it unnecessary to represent all hypothesesexplicitly. Although there are 2^(n) decision trees with n fixed leafnodes it is trivial to assign optimal class labels in O(n) steps withoutrepresenting all 2^(n) alternatives. Similarly, the histogram of errorrates of a set of decision trees or rule sets can be determined in timelogarithmic in the number of hypotheses. We are confident that oursampling algorithm can be applied analogously for complex and structuredhypothesis spaces without explicit representation of all hypotheses.

By giving worst-case bounds on the sample size (and proving that thereis no sampling algorithm for some utility functions) our results alsogive an indication as to which of the many utility functions appearpreferable from a sampling point of view.

REFERENCES

-   [1] C. Domingo, R. Gavelda, and O. Watanabe. Adaptive sampling    methods for scaling up knowledge discovery algorithms. Technical    Report TR-C131, Dept. de LSI, Politecnica de Catalunya, 1999.-   [2] Russell Greiner. PALO: A probabilistic hill-climbing algorithm.    Artificial Intelligence, 83(1–2), July 1996.-   [3] O. Maron and A. Moore. Hoeffding races: Accelerating model    selection search for classifocation and function approximating. In    Advances in Neural Information Processing Systems, pages 59–66,    1994.-   [4] T. Scheffer and S. Wrobel. A sequential sampling algorithm for a    general class of utility functions. In Proceedings of the    International Conference on Knowledge Discovery and Data Mining,    2000.-   [5] H. Toivonen. Sampling large databases for association rules. In    Proc. VLDB Conference, 1996.-   [6] A. Wald. Sequential Analysis. Wiley, 1947.-   [7] Stefan Wrobel. An algorithm for multi-relational discovery of    subgroups. In, Proc. First European Symposion on Principles of Data    Mining and Knowledge Discovery (PKDD-97), pages 78–87, Berlin, 1997,    and EP-A-0 887 749.

1. Method for sampling a database for obtaining the probablyapproximately n best hypotheses having the highest empirically utilityof a group of potential hypotheses, comprising the steps of: a)generating all possible hypotheses based on specifications of a user asremaining hypotheses, b) checking whether enough data points weresampled so far to distinguish within the group of potential hypothesesall good looking hypotheses from bad looking hypotheses with sufficientconfidence, c) if according to step b) enough data points were sampledso far then continue with step j) otherwise continue with step f), d)sampling one or more data points, e) calculating the utility of all theremaining hypotheses on the basis of the sampled data points, f) in theset of remaining hypotheses determining a set of good looking hypothesesby taking the n hypotheses that currently have the highest utility basedon the data points sampled so far, whereas the other hypotheses areadded to a set of bad looking hypotheses, g) checking each of theremaining hypotheses wherein i) if the currently considered hypothesiswith sufficient probability is sufficiently good when compared to allbad looking hypotheses then outputting the currently consideredhypothesis, removing the currently considered hypothesis from the set ofremaining hypotheses and from the set of good looking hypotheses,decrementing the number of hypotheses still to be found and if thenumber of hypotheses still to be found is zero or the number ofremaining hypotheses is equal to the number of hypotheses still to befound, then continue with step j), ii) if the currently consideredhypotheses with sufficient probability is sufficiently bad when comparedwith all the good looking hypotheses then removing the currentlyconsidered hypothesis from the set of bad looking hypotheses from theset of remaining hypotheses and if the number of remaining hypotheses isequal to the number of hypotheses still to be found, then continue withstep j), h) continue with step a), j) outputting to a user the n besthypotheses having the highest utility of all the remaining hypotheses.2. Method according to claim 1, wherein in steps a), g)i) and g)ii) thefollowing step k) is performed: k) determining a utility confidenceinterval given a required maximum probability of error and a sample sizebased on the observed variance of each hypothesis' utility value. 3.Method according to claim 2, wherein in step a) the following step l) isperformed: l) based on step k) determining the size of the utilityconfidence interval for the current data point sample size and theprobability of error as selected by the user and dividing the size ofthe utility confidence interval by two and by the number of remaininghypotheses, wherein enough data points are sampled if the number ascalculated above is smaller than the user selected maximum margin oferror divided by two.
 4. Method according to claim 2 or 3, wherein instep f)i) the following steps are performed: m) determine a locallyallowed error probability based on the probability of error selected bythe user divided by two and the number of remaining hypotheses and thesmallest data point sample size that makes the error confidence intervalaccording to step j) smaller than or equal to the user selected maximumerror margin when determined based on the user selected errorprobability divided by two and the size of the initial set ofhypotheses, n) for each of the bad looking hypotheses, determine the sumof its observed utility value and its error confidence interval size asdetermined according to step k) for the current data point sample sizeand the locally allowed error probability error according to step m), o)selecting the maximum value of all the sums determined in step n), p)subtracting the user selected maximum error margin from the valueselected in step o), q) adding to the number obtained in step p) thesize of the error confidence interval as determined according to step k)for the hypothesis currently considered for outputting based on thecurrent data point sample size and the locally allowed errorprobability, r) if the number determined according to steps m) to q) isno larger than the hypothesis currently considered for outputting andthis hypothesis is a good looking one, then this hypothesis isconsidered sufficiently good with sufficient probability as required instep g)i).
 5. Method according to any one of claims 2 to 4, wherein instep g)ii) the following steps are performed: s) determine a locallyallowed error probability based on the probability of error selected bythe user divided by two and the number of remaining hypotheses and thesmallest data point sample size that makes the error confidence intervalaccording to step k) smaller than or equal to the user selected maximumerror margin when determined based on the user selected errorprobability divided by two and the size of the initial set ofhypotheses, t) for each good looking hypothesis determine the differencebetween its observed utility value and its error confidence interval asdetermined in step k) based on the current data point sample size andthe locally allowed error probability as determined in step s), u)selecting the minimum of all the differences determined in step t), v)subtract from the value selected in step u) the size of the utility,confidence interval of the hypothesis considered for the removal anddetermined according to step k) and the locally allowed errorprobability as determined in step s), w) if the number determined insteps s) to v) is not smaller than the observed utility of thehypothesis considered for removal then this hypothesis is consideredsufficiently bad with sufficient probability as required in step g)ii).